setwd(here::here()) An Archery Game Analysis
This document is a product of the final project for the STAT 570 lecture, focusing on data handling and visualization tools. It is essential to acknowledge that minor errors may be present, and the methods employed may not necessarily reflect the optimal approach related to the data set.
Altan Şener altansener98@gmail.com
Abdullah Mert Çelikkol mertcelikkol10@gmail.com
Alican Aktağ alcnaktag@gmail.com
INTRODUCTION
The data that has been used in this study in gathered from a mobile game that can be played on an Android operating system.
The data has a total of 24762 observations and 66 variables. In the game, there is a spinning ball in the middle and the player throws arrows to the spinning ball by touching the screen, if the arrows hit on another but not the spinning ball, the player has to repeat the level.
There are 200 levels and each level gets more difficult gradually. The value of the data set is derived from their thorough collection, ensuring a wealth of detail. Furthermore, the substantial sample size provides sufficient grounds for meaningful interpretation. Also the sample size is not too large to cause any memory problems, therefore no sub-sample has been taken. The original data set is in json format and has to be cleaned, the variable names have to be adjusted and the types of the columns should be corrected.
This is how a particular json data set looks like:
At the first glance, it is not easy to understand the data. Firstly, we change the format of this data to a csv.
Now it looks better :) Isn’t it?
The game has been available to users for a while (2012). A lot of data was collected during this time. From which country the users are in to which phone they use. In the data set, we can also find how much time players spent on each level, their success rates in the levels, and the android version they used.
Based on what we know, we wondered about some things and wanted to visualize them and share the results we found with you.
A few things we will cover up :
Research Question 1:
- How does the distribution of players vary across continents and countries in the game, and are there noticeable patterns in user engagement based on continent?
Research Question 2:
- What is the distribution of the maximum level attained by users in the game, analyzed in conjunction with their respective continents ?
- Are there significant variations in the maximum level achieved, and do these patterns differ across continents ?”
Research Question 3:
- How does user engagement differ across mobile device categories in the game? Furthermore, are there variations in the maximum levels achieved by users within each category?
- This analysis seeks to uncover patterns and disparities in user behavior based on the distinction between mobile and phone categories.”
Research Question 4:
- What is the distribution of successful trial frequencies at different levels within a mobile game?
The answers and visualizations to these questions will give us a better understanding of the game.
This is a photo of the 43th level of the game. The screenshot is from one of ours phones (Altan, I guess), I got to the 43th level in around 45 minutes, so you can see that the game isn’t that hard.
DATA CLEANING
Disclaimer
This notebook is the first example of a Firebase Analytics data being flattened in R Programming Language, and solutions might not be the most optimal ones.
Preparing JSONL(L stands for new line delimited) data to be in tabular, can sometimes be a fun journey. Especially when you are dealing with Firebase Analytics Data. What is Firebase ? It’s the most widely used data tracking software used for mobile applications. Below is how Firebase analytics data look like when it’s partially cleaned:
Each row is an event, all events have a time stamp and several information such as country , version etc. However, more importantly, there is a key, and values, where values can be integer, float, or string. But not two different at the same time for a given single key. The above is a summary of a generic, mostly clean Firebase analytics data and the following picture is from Google Cloud Bigquery:
Trick 1
You can use the here function from , guess the name here package in R to set working directory to your script’s location, so people who use your code don’t have to change any lines to test it out.
You can now check if the here::here() have worked
getwd()As we previously have shown you, our data is in json format, lets load it and check the head of the data to see how it looks like:
library(jsonlite) ## no error messages here thanks to the "#| warning: false" option!
library(tidyverse)
jsonl_data <- stream_in(file("data.json"),verbose = FALSE)
head(jsonl_data,2) event_date event_timestamp event_name
1 20231221 1703146325195000 app_remove
2 20231221 1703134161235001 screen_view
event_params
1 ga_session_id, ga_session_number, firebase_event_origin, 1702888810, 10, NA, NA, NA, auto
2 entrances, ga_session_number, ga_session_id, firebase_screen_id, firebase_event_origin, engaged_session_event, firebase_screen_class, 1, 52, 1703134160, -8779596609919170096, NA, 1, NA, NA, NA, NA, NA, auto, NA, UnityPlayerActivity
event_bundle_sequence_id event_server_timestamp_offset
1 15 1383899326014
2 100 494
user_pseudo_id privacy_info.analytics_storage
1 eae0e04fa3fa69ef646baaeaa716ea6b Yes
2 74b842a52de12e4cf74034f998ee98cf Yes
privacy_info.ads_storage privacy_info.uses_transient_token
1 Yes No
2 Yes No
user_properties
1 ga_session_id, ga_session_number, first_open_time, 1702888810, 10, 1701284400000, 1702888810463000, 1702888810463000, 1701284050034000
2 ga_session_number, ga_session_id, first_open_time, 52, 1703134160, 1698397200000, 1703134160741000, 1703134160741000, 1698396193166000
user_first_touch_timestamp device.category device.mobile_brand_name
1 1701284050034000 mobile Samsung
2 1698396193166000 mobile Samsung
device.mobile_model_name device.mobile_marketing_name
1 SM-G7102 Galaxy Grand 2
2 SM-J110H Galaxy J1
device.mobile_os_hardware_model device.operating_system
1 SM-G7102 Android
2 SM-J110H Android
device.operating_system_version device.advertising_id
1 Android 4.4.2 f28a3de1-340f-4d84-a491-ae5bc5c66a8a
2 Android 4.4.4 6d58a8cf-8508-45cb-9f0a-41fbbc14d0c7
device.language device.is_limited_ad_tracking device.time_zone_offset_seconds
1 ar-ae Yes 7200
2 ar-ae Yes 7200
geo.city geo.country geo.continent geo.region
1 Kafr el-Sheikh Egypt Africa Kafr El-Sheikh Governorate
2 Jordan Asia Amman Governorate
geo.sub_continent geo.metro app_info.id app_info.version
1 Northern Africa (not set) com.elakerem.focus 2.0.22
2 Western Asia (not set) com.elakerem.focus 2.0.22
app_info.firebase_app_id app_info.install_source
1 1:2474473662:android:5047021a790dce42eb06ef com.android.vending
2 1:2474473662:android:5047021a790dce42eb06ef com.android.vending
traffic_source.name traffic_source.medium traffic_source.source stream_id
1 (direct) (none) (direct) 2758285888
2 (direct) (none) (direct) 2758285888
platform items is_active_user event_previous_timestamp
1 ANDROID NULL FALSE <NA>
2 ANDROID NULL TRUE 1703035993871001
collected_traffic_source.manual_source collected_traffic_source.manual_medium
1 <NA> <NA>
2 <NA> <NA>
Overall, it is uninterpreted , because we have key value formats, data frames of data frames, lists of data frames, and all sorts of weird things. Let’s also without going into next levels through use of max.level=1 argument to check out the data again
str(jsonl_data,max.level = 1)'data.frame': 24762 obs. of 20 variables:
$ event_date : chr "20231221" "20231221" "20231221" "20231221" ...
$ event_timestamp : chr "1703146325195000" "1703134161235001" "1703134235541004" "1703134279049008" ...
$ event_name : chr "app_remove" "screen_view" "level_end" "user_engagement" ...
$ event_params :List of 24762
$ event_bundle_sequence_id : chr "15" "100" "100" "100" ...
$ event_server_timestamp_offset: chr "1383899326014" "494" "494" "494" ...
$ user_pseudo_id : chr "eae0e04fa3fa69ef646baaeaa716ea6b" "74b842a52de12e4cf74034f998ee98cf" "74b842a52de12e4cf74034f998ee98cf" "74b842a52de12e4cf74034f998ee98cf" ...
$ privacy_info :'data.frame': 24762 obs. of 3 variables:
$ user_properties :List of 24762
$ user_first_touch_timestamp : chr "1701284050034000" "1698396193166000" "1698396193166000" "1698396193166000" ...
$ device :'data.frame': 24762 obs. of 11 variables:
$ geo :'data.frame': 24762 obs. of 6 variables:
$ app_info :'data.frame': 24762 obs. of 4 variables:
$ traffic_source :'data.frame': 24762 obs. of 3 variables:
$ stream_id : chr "2758285888" "2758285888" "2758285888" "2758285888" ...
$ platform : chr "ANDROID" "ANDROID" "ANDROID" "ANDROID" ...
$ items :List of 24762
$ is_active_user : logi FALSE TRUE TRUE TRUE TRUE TRUE ...
$ event_previous_timestamp : chr NA "1703035993871001" "1703134211021004" "1703036243245008" ...
$ collected_traffic_source :'data.frame': 24762 obs. of 2 variables:
So we have some lists, some data frames, some characters, and some logical ones, we got a beautiful soup of data types. Recall, the event_params column from the example picture
event_params has key and value nestings, and value has string, float, int double value nestings, and these are all in different formats, lets take a look at event_params with str function
str(jsonl_data$event_params,list.len=3)List of 24762
$ :'data.frame': 3 obs. of 2 variables:
..$ key : chr [1:3] "ga_session_id" "ga_session_number" "firebase_event_origin"
..$ value:'data.frame': 3 obs. of 2 variables:
.. ..$ int_value : chr [1:3] "1702888810" "10" NA
.. ..$ string_value: chr [1:3] NA NA "auto"
$ :'data.frame': 7 obs. of 2 variables:
..$ key : chr [1:7] "entrances" "ga_session_number" "ga_session_id" "firebase_screen_id" ...
..$ value:'data.frame': 7 obs. of 2 variables:
.. ..$ int_value : chr [1:7] "1" "52" "1703134160" "-8779596609919170096" ...
.. ..$ string_value: chr [1:7] NA NA NA NA ...
$ :'data.frame': 9 obs. of 2 variables:
..$ key : chr [1:9] "ga_session_number" "level" "firebase_screen_id" "firebase_event_origin" ...
..$ value:'data.frame': 9 obs. of 2 variables:
.. ..$ int_value : chr [1:9] "52" "1" "-8779596609919170096" NA ...
.. ..$ string_value: chr [1:9] NA NA NA "app" ...
[list output truncated]
We have a complicated format, here is a demonstration of how to access event_params and its sub parts
class(jsonl_data) ## Whole data class [1] "data.frame"
class(jsonl_data$event_params) ## Event params class [1] "list"
jsonl_data$event_params[[1]] ## access first row's event_params key value.int_value value.string_value
1 ga_session_id 1702888810 <NA>
2 ga_session_number 10 <NA>
3 firebase_event_origin <NA> auto
class(jsonl_data$event_params[[1]]) ## it's class [1] "data.frame"
jsonl_data$event_params[[1]][1] ## how to access event_params$key key
1 ga_session_id
2 ga_session_number
3 firebase_event_origin
class(jsonl_data$event_params[[1]][1]) ## it's class[1] "data.frame"
jsonl_data$event_params[[1]][2] ## how to access event_params$value value.int_value value.string_value
1 1702888810 <NA>
2 10 <NA>
3 <NA> auto
class(jsonl_data$event_params[[1]][2]) ## it's class[1] "data.frame"
We have a data frame jsonl_data, it has a list event_params, list is made of data frames, and in the data frame, we have key column, and a data frame named value, which has two columns, named int_value and string_value.
Here is the issue, our data also has inconsistencies. At bellow, take a look at two different “dataframes” under the event_params:
jsonl_data$event_params[[4983]] key string_value
1 previous_os_version 11
2 firebase_event_origin auto
jsonl_data$event_params[[1]] key value.int_value value.string_value
1 ga_session_id 1702888810 <NA>
2 ga_session_number 10 <NA>
3 firebase_event_origin <NA> auto
Lets begin cleaning, firstly there were some non data frame, lists objects inside the event_params, we will iterate over the event_params, and convert them.
list_of_dfs=list()
for(i in 1:nrow(jsonl_data)){
temp_df= as.data.frame(jsonl_data$event_params[[i]])
list_of_dfs[[i]] <- temp_df
}
class(list_of_dfs)[1] "list"
list_of_dfs[[1]] key value.int_value value.string_value
1 ga_session_id 1702888810 <NA>
2 ga_session_number 10 <NA>
3 firebase_event_origin <NA> auto
At the next step, recall we had int value, and string value for each key, with one of these two always being null. Since one of them is always null, we can concatenate them by binding columns , but we got two type of objects inside the event_params(was three before the above loop), the case when there is event_params with key, and value with value having two more sub-columns and the case with even_params having no nested value column but instead a “string_value” column. We can iterate over it in a for loop, for these two specific cases fix, bind the columns and unnest it out of the value and have a single value column, and when there is only key with string_value, we can just rename the column. Just for fun, we will use try catch because why not to learn it while working in R. Here is the example usage of try catch:
Try
To Do Something
Except ## D0f (What?) fail
Do something else instead
for (i in seq_along(list_of_dfs)) {
tryCatch({
if ("value" %in% names(list_of_dfs[[i]])) {
if (is.list(list_of_dfs[[i]]$value)) {
list_of_dfs[[i]]$value <- ifelse(
!is.na(list_of_dfs[[i]]$value$int_value),
as.character(list_of_dfs[[i]]$value$int_value),
as.character(list_of_dfs[[i]]$value$string_value)
)
list_of_dfs[[i]]$value <- as.character(list_of_dfs[[i]]$value)
list_of_dfs[[i]] <- list_of_dfs[[i]][, !(names(list_of_dfs[[i]]) %in% c("int_value", "string_value"))]
}
}
}, error = function(e) {
# If an error occurs, rename the second column to "value"
if (length(names(list_of_dfs[[i]])) >= 2) {
new_temp_df= cbind.data.frame(list_of_dfs[[i]][[1]],c(list_of_dfs[[i]][[2]]))
names(new_temp_df) <- c("key","value")
list_of_dfs[[i]]<<- new_temp_df
}
})
}
list_of_dfs[[1]] key value
1 ga_session_id 1702888810
2 ga_session_number 10
3 firebase_event_origin auto
Note that try catch and error handling is usually much slower than using if statements and should primarily be used for cases that can’t be predicted or handled with regular methods. But it can also be faster when used in place of an extremely complex if check.
At the next step, we flatten the event_params. Instead of having key and values, what if every unique key was a data set column, and values were under it, and if in that respective row, there is no element for a specific key we can just keep null, this would make it easier for anyone else working in this data later.
Luckily, Tidyverse ensures we don’t have to write a complex loop here since we have dealt with the inconsistencies in our data but first lets get these thing out of list of data frames into a single DF, we will use bind_rows function again from tidyverse:
combined_df <- bind_rows(list_of_dfs, .id = "df_id")
head(combined_df) df_id key value
1 1 ga_session_id 1702888810
2 1 ga_session_number 10
3 1 firebase_event_origin auto
4 2 entrances 1
5 2 ga_session_number 52
6 2 ga_session_id 1703134160
Now all we got to do is change the binded element from long to wide format for each ID, pivot_wider will automatically fill it with nulls for cases when in that row a specific key is not used.
flattened_df <- combined_df %>%
pivot_wider(names_from = key, values_from = value)
flattened_df <- flattened_df[, -1]
head(flattened_df )# A tibble: 6 × 24
ga_session_id ga_session_number firebase_event_origin entrances
<chr> <chr> <chr> <chr>
1 1702888810 10 auto <NA>
2 1703134160 52 auto 1
3 1703134160 52 app <NA>
4 1703134160 52 auto <NA>
5 1703190214 53 auto 1
6 1703190214 53 app <NA>
# ℹ 20 more variables: firebase_screen_id <chr>, engaged_session_event <chr>,
# firebase_screen_class <chr>, level <chr>, success <chr>, value <chr>,
# engagement_time_msec <chr>, session_engaged <chr>,
# firebase_previous_class <chr>, firebase_previous_id <chr>,
# update_with_analytics <chr>, system_app <chr>,
# previous_first_open_count <chr>, system_app_update <chr>,
# firebase_conversion <chr>, source <chr>, medium <chr>, …
We can now cbind this into our main dataframe, and also drop some other irrelevant columns and the column we have just flattened
final_df= cbind.data.frame(jsonl_data,flattened_df)
final_df2= final_df |> select(-event_params,-user_properties,-items) Let’s take a look at our data again
str(final_df2)'data.frame': 24762 obs. of 41 variables:
$ event_date : chr "20231221" "20231221" "20231221" "20231221" ...
$ event_timestamp : chr "1703146325195000" "1703134161235001" "1703134235541004" "1703134279049008" ...
$ event_name : chr "app_remove" "screen_view" "level_end" "user_engagement" ...
$ event_bundle_sequence_id : chr "15" "100" "100" "100" ...
$ event_server_timestamp_offset: chr "1383899326014" "494" "494" "494" ...
$ user_pseudo_id : chr "eae0e04fa3fa69ef646baaeaa716ea6b" "74b842a52de12e4cf74034f998ee98cf" "74b842a52de12e4cf74034f998ee98cf" "74b842a52de12e4cf74034f998ee98cf" ...
$ privacy_info :'data.frame': 24762 obs. of 3 variables:
..$ analytics_storage : chr "Yes" "Yes" "Yes" "Yes" ...
..$ ads_storage : chr "Yes" "Yes" "Yes" "Yes" ...
..$ uses_transient_token: chr "No" "No" "No" "No" ...
$ user_first_touch_timestamp : chr "1701284050034000" "1698396193166000" "1698396193166000" "1698396193166000" ...
$ device :'data.frame': 24762 obs. of 11 variables:
..$ category : chr "mobile" "mobile" "mobile" "mobile" ...
..$ mobile_brand_name : chr "Samsung" "Samsung" "Samsung" "Samsung" ...
..$ mobile_model_name : chr "SM-G7102" "SM-J110H" "SM-J110H" "SM-J110H" ...
..$ mobile_marketing_name : chr "Galaxy Grand 2" "Galaxy J1" "Galaxy J1" "Galaxy J1" ...
..$ mobile_os_hardware_model: chr "SM-G7102" "SM-J110H" "SM-J110H" "SM-J110H" ...
..$ operating_system : chr "Android" "Android" "Android" "Android" ...
..$ operating_system_version: chr "Android 4.4.2" "Android 4.4.4" "Android 4.4.4" "Android 4.4.4" ...
..$ advertising_id : chr "f28a3de1-340f-4d84-a491-ae5bc5c66a8a" "6d58a8cf-8508-45cb-9f0a-41fbbc14d0c7" "6d58a8cf-8508-45cb-9f0a-41fbbc14d0c7" "6d58a8cf-8508-45cb-9f0a-41fbbc14d0c7" ...
..$ language : chr "ar-ae" "ar-ae" "ar-ae" "ar-ae" ...
..$ is_limited_ad_tracking : chr "Yes" "Yes" "Yes" "Yes" ...
..$ time_zone_offset_seconds: chr "7200" "7200" "7200" "7200" ...
$ geo :'data.frame': 24762 obs. of 6 variables:
..$ city : chr "Kafr el-Sheikh" "" "" "" ...
..$ country : chr "Egypt" "Jordan" "Jordan" "Jordan" ...
..$ continent : chr "Africa" "Asia" "Asia" "Asia" ...
..$ region : chr "Kafr El-Sheikh Governorate" "Amman Governorate" "Amman Governorate" "Amman Governorate" ...
..$ sub_continent: chr "Northern Africa" "Western Asia" "Western Asia" "Western Asia" ...
..$ metro : chr "(not set)" "(not set)" "(not set)" "(not set)" ...
$ app_info :'data.frame': 24762 obs. of 4 variables:
..$ id : chr "com.elakerem.focus" "com.elakerem.focus" "com.elakerem.focus" "com.elakerem.focus" ...
..$ version : chr "2.0.22" "2.0.22" "2.0.22" "2.0.22" ...
..$ firebase_app_id: chr "1:2474473662:android:5047021a790dce42eb06ef" "1:2474473662:android:5047021a790dce42eb06ef" "1:2474473662:android:5047021a790dce42eb06ef" "1:2474473662:android:5047021a790dce42eb06ef" ...
..$ install_source : chr "com.android.vending" "com.android.vending" "com.android.vending" "com.android.vending" ...
$ traffic_source :'data.frame': 24762 obs. of 3 variables:
..$ name : chr "(direct)" "(direct)" "(direct)" "(direct)" ...
..$ medium: chr "(none)" "(none)" "(none)" "(none)" ...
..$ source: chr "(direct)" "(direct)" "(direct)" "(direct)" ...
$ stream_id : chr "2758285888" "2758285888" "2758285888" "2758285888" ...
$ platform : chr "ANDROID" "ANDROID" "ANDROID" "ANDROID" ...
$ is_active_user : logi FALSE TRUE TRUE TRUE TRUE TRUE ...
$ event_previous_timestamp : chr NA "1703035993871001" "1703134211021004" "1703036243245008" ...
$ collected_traffic_source :'data.frame': 24762 obs. of 2 variables:
..$ manual_source: chr NA NA NA NA ...
..$ manual_medium: chr NA NA NA NA ...
$ ga_session_id : chr "1702888810" "1703134160" "1703134160" "1703134160" ...
$ ga_session_number : chr "10" "52" "52" "52" ...
$ firebase_event_origin : chr "auto" "auto" "app" "auto" ...
$ entrances : chr NA "1" NA NA ...
$ firebase_screen_id : chr NA "-8779596609919170096" "-8779596609919170096" "-8779596609919170096" ...
$ engaged_session_event : chr NA "1" "1" "1" ...
$ firebase_screen_class : chr NA "UnityPlayerActivity" "UnityPlayerActivity" "UnityPlayerActivity" ...
$ level : chr NA NA "1" NA ...
$ success : chr NA NA "0" NA ...
$ value : chr NA NA "0" NA ...
$ engagement_time_msec : chr NA NA NA "114189" ...
$ session_engaged : chr NA NA NA NA ...
$ firebase_previous_class : chr NA NA NA NA ...
$ firebase_previous_id : chr NA NA NA NA ...
$ update_with_analytics : chr NA NA NA NA ...
$ system_app : chr NA NA NA NA ...
$ previous_first_open_count : chr NA NA NA NA ...
$ system_app_update : chr NA NA NA NA ...
$ firebase_conversion : chr NA NA NA NA ...
$ source : chr NA NA NA NA ...
$ medium : chr NA NA NA NA ...
$ campaign_info_source : chr NA NA NA NA ...
$ previous_os_version : chr NA NA NA NA ...
$ appnava_churn_prob : chr NA NA NA NA ...
Our data is now free of the chaos of event_params. We still have few more things to do as we got some columns nested, example, geo column has other columns under it like geo.country, geo.continent and so on, and this might make future work slower.
str(final_df2$geo)'data.frame': 24762 obs. of 6 variables:
$ city : chr "Kafr el-Sheikh" "" "" "" ...
$ country : chr "Egypt" "Jordan" "Jordan" "Jordan" ...
$ continent : chr "Africa" "Asia" "Asia" "Asia" ...
$ region : chr "Kafr El-Sheikh Governorate" "Amman Governorate" "Amman Governorate" "Amman Governorate" ...
$ sub_continent: chr "Northern Africa" "Western Asia" "Western Asia" "Western Asia" ...
$ metro : chr "(not set)" "(not set)" "(not set)" "(not set)" ...
To ensure our data is friendly for anyone, we can write a small function that detect such “dataframes under our dataframe” and unnest them
is_dataframe <- function(column) {
is.data.frame(column)
}
### object to save columns which are dataframes
dataframe_cols <- c()
# Loop through each column in final_df2
for (col in colnames(final_df2)) {
if (is_dataframe(final_df2[[col]])) {
dataframe_cols <- c(dataframe_cols, col)
}
}
## loop through them, take the column under column to outside of it and combined them
combined_nested_dfs=rep(0,nrow(final_df2))
for (element in dataframe_cols){
temp_index=as.character(element)
temp_df= final_df2[[temp_index]]
combined_nested_dfs=cbind.data.frame(combined_nested_dfs,temp_df)
}
final_df2=final_df2 |> select(-dataframe_cols)Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
# Was:
data %>% select(dataframe_cols)
# Now:
data %>% select(all_of(dataframe_cols))
See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
final_df3=cbind.data.frame(final_df2,combined_nested_dfs)
options(scipen=999)
write.csv(final_df3,"plsfkinwork3.csv")We have written lots of code to have an “end scientist” friendly rectangular dataset final look before we end the dataprep.
str(final_df3)'data.frame': 24762 obs. of 65 variables:
$ event_date : chr "20231221" "20231221" "20231221" "20231221" ...
$ event_timestamp : chr "1703146325195000" "1703134161235001" "1703134235541004" "1703134279049008" ...
$ event_name : chr "app_remove" "screen_view" "level_end" "user_engagement" ...
$ event_bundle_sequence_id : chr "15" "100" "100" "100" ...
$ event_server_timestamp_offset: chr "1383899326014" "494" "494" "494" ...
$ user_pseudo_id : chr "eae0e04fa3fa69ef646baaeaa716ea6b" "74b842a52de12e4cf74034f998ee98cf" "74b842a52de12e4cf74034f998ee98cf" "74b842a52de12e4cf74034f998ee98cf" ...
$ user_first_touch_timestamp : chr "1701284050034000" "1698396193166000" "1698396193166000" "1698396193166000" ...
$ stream_id : chr "2758285888" "2758285888" "2758285888" "2758285888" ...
$ platform : chr "ANDROID" "ANDROID" "ANDROID" "ANDROID" ...
$ is_active_user : logi FALSE TRUE TRUE TRUE TRUE TRUE ...
$ event_previous_timestamp : chr NA "1703035993871001" "1703134211021004" "1703036243245008" ...
$ ga_session_id : chr "1702888810" "1703134160" "1703134160" "1703134160" ...
$ ga_session_number : chr "10" "52" "52" "52" ...
$ firebase_event_origin : chr "auto" "auto" "app" "auto" ...
$ entrances : chr NA "1" NA NA ...
$ firebase_screen_id : chr NA "-8779596609919170096" "-8779596609919170096" "-8779596609919170096" ...
$ engaged_session_event : chr NA "1" "1" "1" ...
$ firebase_screen_class : chr NA "UnityPlayerActivity" "UnityPlayerActivity" "UnityPlayerActivity" ...
$ level : chr NA NA "1" NA ...
$ success : chr NA NA "0" NA ...
$ value : chr NA NA "0" NA ...
$ engagement_time_msec : chr NA NA NA "114189" ...
$ session_engaged : chr NA NA NA NA ...
$ firebase_previous_class : chr NA NA NA NA ...
$ firebase_previous_id : chr NA NA NA NA ...
$ update_with_analytics : chr NA NA NA NA ...
$ system_app : chr NA NA NA NA ...
$ previous_first_open_count : chr NA NA NA NA ...
$ system_app_update : chr NA NA NA NA ...
$ firebase_conversion : chr NA NA NA NA ...
$ source : chr NA NA NA NA ...
$ medium : chr NA NA NA NA ...
$ campaign_info_source : chr NA NA NA NA ...
$ previous_os_version : chr NA NA NA NA ...
$ appnava_churn_prob : chr NA NA NA NA ...
$ combined_nested_dfs : num 0 0 0 0 0 0 0 0 0 0 ...
$ analytics_storage : chr "Yes" "Yes" "Yes" "Yes" ...
$ ads_storage : chr "Yes" "Yes" "Yes" "Yes" ...
$ uses_transient_token : chr "No" "No" "No" "No" ...
$ category : chr "mobile" "mobile" "mobile" "mobile" ...
$ mobile_brand_name : chr "Samsung" "Samsung" "Samsung" "Samsung" ...
$ mobile_model_name : chr "SM-G7102" "SM-J110H" "SM-J110H" "SM-J110H" ...
$ mobile_marketing_name : chr "Galaxy Grand 2" "Galaxy J1" "Galaxy J1" "Galaxy J1" ...
$ mobile_os_hardware_model : chr "SM-G7102" "SM-J110H" "SM-J110H" "SM-J110H" ...
$ operating_system : chr "Android" "Android" "Android" "Android" ...
$ operating_system_version : chr "Android 4.4.2" "Android 4.4.4" "Android 4.4.4" "Android 4.4.4" ...
$ advertising_id : chr "f28a3de1-340f-4d84-a491-ae5bc5c66a8a" "6d58a8cf-8508-45cb-9f0a-41fbbc14d0c7" "6d58a8cf-8508-45cb-9f0a-41fbbc14d0c7" "6d58a8cf-8508-45cb-9f0a-41fbbc14d0c7" ...
$ language : chr "ar-ae" "ar-ae" "ar-ae" "ar-ae" ...
$ is_limited_ad_tracking : chr "Yes" "Yes" "Yes" "Yes" ...
$ time_zone_offset_seconds : chr "7200" "7200" "7200" "7200" ...
$ city : chr "Kafr el-Sheikh" "" "" "" ...
$ country : chr "Egypt" "Jordan" "Jordan" "Jordan" ...
$ continent : chr "Africa" "Asia" "Asia" "Asia" ...
$ region : chr "Kafr El-Sheikh Governorate" "Amman Governorate" "Amman Governorate" "Amman Governorate" ...
$ sub_continent : chr "Northern Africa" "Western Asia" "Western Asia" "Western Asia" ...
$ metro : chr "(not set)" "(not set)" "(not set)" "(not set)" ...
$ id : chr "com.elakerem.focus" "com.elakerem.focus" "com.elakerem.focus" "com.elakerem.focus" ...
$ version : chr "2.0.22" "2.0.22" "2.0.22" "2.0.22" ...
$ firebase_app_id : chr "1:2474473662:android:5047021a790dce42eb06ef" "1:2474473662:android:5047021a790dce42eb06ef" "1:2474473662:android:5047021a790dce42eb06ef" "1:2474473662:android:5047021a790dce42eb06ef" ...
$ install_source : chr "com.android.vending" "com.android.vending" "com.android.vending" "com.android.vending" ...
$ name : chr "(direct)" "(direct)" "(direct)" "(direct)" ...
$ medium : chr "(none)" "(none)" "(none)" "(none)" ...
$ source : chr "(direct)" "(direct)" "(direct)" "(direct)" ...
$ manual_source : chr NA NA NA NA ...
$ manual_medium : chr NA NA NA NA ...
Lets save the results before we begin our data analysis.
write.csv(final_df3,"plsfkinwork3.csv")DATA ANALYSIS
Lets load our libraries, you can disable package loading warnings with warning = FALSE argument, bellow chunk is code only
setwd(here::here())
df=read.csv("plsfkinwork3.csv")
library(dplyr)
library(ggplot2)
library(purrr)
library(jsonlite)
library(tidyr)
options(scipen=999)
df=df[,-1] Firstly lets plot a histogram of Session duration of users. There are two ways to calculate it, first one is to get start timestamp of a session and end timestamp of it, take the difference and that is your duration. However since most users move their app to background while using their phones, scroll instagram, open snapchat and so on, Firebase datasets have another method to calculate user active duration.
Firebase counts how long they users keep the app on foreground, and send total duration in micro seconds inside the event_params in a key named engagement_time_msec. We can sum engagement_time_msec by session and that would be the average active session duration.
df$engagement_time_msec=as.numeric(df$engagement_time_msec)
total_session_duration_per_user <- df %>%
group_by(user_pseudo_id) %>%
summarize(total_engagement_time = sum(engagement_time_msec,na.rm = TRUE))
total_session_duration_per_user$total_engagement_time=total_session_duration_per_user$total_engagement_time/60000
hist(total_session_duration_per_user$total_engagement_time,breaks=50)Our data manipulation have worked and we are now capable of printing some histograms, however submitting such an Instagram (What?) can get you fired, or cause you to receive lower grades for this reason we will make a histogram with beautiful aesthetics , and definitely better breaks
breaks <- c(0, 1, 2, 3, 5, 10, 20, 60, 1440)
total_session_duration_per_user$break_group <- cut(total_session_duration_per_user$total_engagement_time, breaks = breaks, labels = FALSE, include.lowest = TRUE)
custom_labels <- c("0-1", "1-2", "2-3", "3-5", "5-10", "10-20", "20-60", "60-1 day", "1 day+")
ggplot(total_session_duration_per_user, aes(x = factor(break_group))) +
geom_bar(fill = "blue", color = "black", alpha = 0.7) +
scale_x_discrete(breaks = seq_along(breaks) - 1, labels = custom_labels) +
labs(title = "Total Session Duration Histogram",
x = "Total Session Duration (minutes)",
y = "Frequency")Total session duration per user is visualized with comprehensible breaks. Gamers mostly prefer playing and closing the game in short durations; however, we suspect that there are also excessively long sessions due to the time when the game is running in the background and not actively being played.
distinct_sessions_per_user <- df %>%
group_by(user_pseudo_id) %>%
summarize(distinct_session_count = n_distinct(ga_session_id))
ggplot(distinct_sessions_per_user, aes(x = distinct_session_count)) +
geom_histogram(fill = "blue", color = "black", alpha = 0.7, bins = 20) +
labs(title = "Distinct Sessions per User Histogram",
x = "Distinct Session Count",
y = "Frequency")The repetition of playing the game by players within the range of our data is in line with the results of our previous analysis. Mostly, distinct players seem to have opened the game 1 or 2 times. There are also users who have opened more than 20 sessions, which may be considered as outliers.
df$user_first_touch_timestamp = as.numeric(df$user_first_touch_timestamp)
df$event_timestamp = as.numeric(df$event_timestamp)
user_ages <- df %>%
group_by(user_pseudo_id) %>%
summarise(time_difference = max(event_timestamp) - max(user_first_touch_timestamp))
user_ages$time_difference= user_ages$time_difference/60000. #BURAYLA D0LGD0LD0 ANALD0Z YAPALIM!
event_counts <- df %>%
count(event_name)
ggplot(event_counts, aes(x = reorder(event_name, -n), y = n)) +
geom_bar(stat = "identity", fill = "skyblue") +
labs(title = "Event Counts",
x = "Event Name",
y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels for better readabilityWe explained how we parsed the events in the first part. To get a preliminary idea about the diversity and distribution density of events in our data set, we prepared a bar plot. Despite the recording of event data in different types, level_end has the largest share. This result will contribute to strengthening our interpretations in future analyses.
df$mobile_brand_name= as.factor(df$mobile_brand_name)
df$continent= as.factor(df$continent)
distinct_users <- df %>%
select(user_pseudo_id, continent) %>%
distinct()
ggplot(distinct_users, aes(x = continent)) +
geom_bar(fill = "skyblue", color = "black") +
labs(title = "Distribution of Distinct Users Across Continents",
x = "Continent",
y = "Number of Distinct Users") +
theme_minimal()Our first research question was: How does the distribution of players vary across continents and countries in the game, and are there noticeable patterns in user engagement based on continent?
To answer this question, distinct users are selected at first. Then, a bar plot for the distribution of distinct users across continents is created. According to this part of the question, it is observed that users from Asia contribute the most to the intercontinental distribution with 175 users. The participation numbers in the continents of Europe and America are below 25.
users_per_country <- df %>%
group_by(continent, country) %>%
summarise(distinct_user_count = n_distinct(user_pseudo_id))
users_per_country_filtered <- users_per_country %>%
filter(distinct_user_count > 3)
users_per_country_filtered <- users_per_country_filtered %>%
arrange(desc(distinct_user_count)) %>%
mutate(country_display = ifelse(country == "T<U+00FC>rkiye", "Turkey", country))
ggplot(users_per_country_filtered, aes(x = reorder(country_display, distinct_user_count), y = distinct_user_count, size = distinct_user_count, color = continent)) +
geom_point() +
scale_size_continuous(name = "Distinct User Count", range = c(5, 17)) +
labs(title = "Bubble Chart of Distinct Users by Country",
x = "Country",
y = "Distinct User Count",
color = "Continent") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) When we examine the question with a more general visualization, it is noticeable that the blue colored Asian countries - colors are indicating continental differences - are dominant both in terms of diversity and frequency. The size of the bubbles indicates the magnitude of the distinct user count. It can be commented that more than 30 distinct users from Egypt and Turkey have been actively engaged in the game.
max_level_data <- df %>%
group_by(user_pseudo_id, continent) %>%
summarize(max_level = max(level, na.rm = TRUE))
ggplot(max_level_data, aes(x = continent, y = max_level,fill = continent)) +
geom_boxplot() +
labs(title = "Box Plot of Maximum Level by Continent",
x = "Continent",
y = "Max Level"
) +
theme_minimal()The second research question was: What is the distribution of the maximum level attained by users in the game, analyzed in conjunction with their respective continents? Are there significant variations in the maximum level achieved, and do these patterns differ across continents?
The maximum levels that each user played is selected at first. Then, to answer the question, boxplot of the continents is created. Although there seems to be a similar distribution overall, there are more user outliers in Africa and Asia. One of the reasons for this is the uneven distribution of users on the continents, as seen in our previous graphs. Nevertheless, the difference between the 1st and 3rd quartiles in the African continent is striking, and it is observed that Asia has the highest median value.
distinct_users_category <- df %>%
select(user_pseudo_id, category) %>%
distinct()
ggplot(distinct_users_category, aes(x = category)) +
geom_bar(fill = "skyblue", color = "black") +
labs(title = "Distribution of Distinct Users Across Categories",
x = "Category",
y = "Number of Distinct Users") +
theme_minimal()Research Question 3 covers How does user engagement differ across mobile device categories in the game? Furthermore, are there variations in the maximum levels achieved by users within each category?
This analysis seeks to uncover patterns and disparities in user behavior based on the distinction between mobile and phone categories. Distinct users and their associated category is selected and it is observed that users generally prefer playing from their mobile phones.
max_level_data_category <- df %>%
group_by(user_pseudo_id, category) %>%
summarize(max_level = max(level, na.rm = TRUE))
ggplot(max_level_data_category, aes(x = category, y = max_level,fill = category)) +
geom_boxplot() +
labs(title = "Box Plot of Maximum Level by Category",
x = "Category",
y = "Max Level",
fill = "Category") +
theme_minimal()When comparing users’ achievements in reaching the maximum levels through box plots, no significant difference has been observed.
success_data <- df %>%
filter(event_name == "level_end" & success == 1)
leveltrialcountbyuser_wsuccess <- success_data %>%
group_by(user_pseudo_id, level) %>%
summarise(trials = n())
leveltrialcountbyuser_wsuccess <- leveltrialcountbyuser_wsuccess %>%
group_by(level) %>%
summarise(distinct_user_count = n(), trials = sum(trials))
ggplot(leveltrialcountbyuser_wsuccess, aes(x = level, y = distinct_user_count, fill = factor(level))) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Frequency of Trials at Each Level",
x = "Level",
y = "Frequency",
fill = "Level") +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none")Final research question is What is the distribution of successful trial frequencies at different levels within the game? As a continuation of the research on levels, we can comment on whether players go back and replay the same level or which level they can pass more easily. It seems that the points where the drop in the graph is broken are around levels 17 and 26. After level 50, the total number of attempts decreases below 15.
leveltrialcountbyuser =df |> filter(event_name=="level_end") |>
group_by(level) |> count(level)
levelsuccesscountbyuser =df |> filter(event_name=="level_end" & success=="1") |>
group_by(level) |>count(level)
colnames(levelsuccesscountbyuser)[2]="success"
merged_data <- merge(leveltrialcountbyuser, levelsuccesscountbyuser, by = "level", all = TRUE)
merged_data$success <- ifelse(is.na(merged_data$success), 0, merged_data$success)
head(merged_data) level n success
1 1 5213 0
2 2 127 117
3 3 189 109
4 4 219 100
5 5 268 93
6 6 354 81
merged_data[,"success_rate"]<- merged_data$success/merged_data$n
merged_data$level=merged_data$level-1
ggplot(merged_data[merged_data$level<10 &merged_data$level!=0 ,], aes(x = level, y = success_rate)) +
geom_bar(stat = "identity", fill = "skyblue", color = "black", width = 0.7) +
labs(x = "Level", y = "Success Rate", title = "Funnel Plot of Success Rate by Level") +
theme_minimal() +
scale_x_continuous(breaks = df$level)To understand the success rates of users between levels, we calculated the success rates of the levels. In our data set, success column indicates 1 when users are successful in the level 1 and 0 when unsuccessful. We calculated the success rate by dividing the success rate to total number of attempts. As expected level 1 seems to be the easier level to pass.
ABOUT US
Altan Sener kertmeyenkeleilekertenkele@gmail.com
Abdullah Mert Celikkol
Alican Aktag alcnaktag@gmail.com
Feel free to ask all your questions